-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Adds selective hyperparameter optimization #58
base: main
Are you sure you want to change the base?
Conversation
Also fixes double-dipping problems in model fitting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the most part LGTM. Minor changes are required to ensure edge cases are well handeld and that users understand the useag of hyperparameter optimization.
err_msg = ( | ||
"Hyperparameter tuning parameters must be in the format " | ||
"param1=value1_low,value1_high param2=value2_low, value2_high..." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should also check that always two values given a low and a high. What happens if someone gives C=1,2,3? this should through an error. As the user should write C=1,3 and ht=3 to have 3 hyperparameter points 1,2,3. Also ht should be a minimum of 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about hyperparameters like kernels? where you want to choose different kernels. This should not be affected by ht.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed! thanks for this
src/ageml/messages.py
Outdated
"a model. The parameter ranges are predefined. An integer is required.\n" | ||
"(e.g. -ht 100 / --hyperparameter_tuning 100)" | ||
"a model, and parameter ranges to sample from. An integer is required, followed \n" | ||
"by the parameters to optimize. (e.g. -ht 100 C=1,2,3 kernel=linear,rbf)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not be only C=1,2? A low and a high value? Also how do you deal with if you put 3 kerrnel types. Also 100 hyperparameter grid points is unrealistic. Too many may lead users too put too many.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, changed the C=1,2,3 to C=1,3. Also changed 100 to 10.
Working on dealing with string hyperparams.
for hyperparam_name in list(self.hyperparameter_params.keys()): | ||
low, high = self.hyperparameter_params[hyperparam_name] | ||
if hyperparam_types[hyperparam_name] == "log": | ||
param_grid[f"model__{hyperparam_name}"] = np.logspace(low, high, int(self.hyperparameter_tuning)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where are you telling the susers what are logs, ints etc...? Like if I choose C=-3,3 how do i know if it is using log? Or should I have to write C=0.001, 100?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add it in the documentation. And a link to the documentation explaining this.
src/ageml/modelling.py
Outdated
self.pipeline.set_params(**opt_pipeline.best_params_) | ||
if self.model_type != "hyperopt": | ||
# Optimize hyperparameters if required | ||
if self.hyperparameter_tuning > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hyperparameter tuning should be at least 2 points? will give an error if you have C=1,2 and only have one point. Will it choose C=1 or C=2?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. Missed this, changing it to 2.
print("Test: MAE %.2f ± %.2f, RMSE %.2f ± %.2f, R2 %.3f ± %.3f, p %.3f ± %.3f" % tuple(summary_test)) | ||
|
||
# Select the best pipeline based on the mean scores | ||
best_hyperparam_index = np.argmin(mae_means_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can cause memory problems because you are keeping all the predicted ages and corrected ages until you have trained all the models. You should keep all the scores of all the different parameter grids but at each iteration of the grid keep the predicted and corrected ages only if the model is better than what it was before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, solving it
|
||
if self.args.model_type == "hyperopt": | ||
# Split the data in training and test sets | ||
# TODO: Substitute 0.2 factor by the user CLI argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this left as a todo instead of actually implementing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time
src/ageml/ui.py
Outdated
@@ -1761,7 +1800,9 @@ def model_age_command(self): | |||
print("Hyperparameter tuning. Number of points in grid search: (Default: 0)") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoudl the default if we are actually running hypeparameter tunning not be at least 2? Something maybe like 5? Is not it should be saying hyperparameter tunning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, missed this
test_indices = indices[-test_size:] | ||
|
||
else: | ||
train_indices = list(range(X.shape[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this train and test indices? I see that they are the whole data structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason behind it is to avoid writing if
flow control statements every time I need to split the dataset depending on the model selection ('hyperopt' or not). It is an efficiency thing
Only the results of thhe best model are stored, to avoid using memory for predicted ages that would be discarded in the future anyways. Also adds link to documentation explaining the 'log' 'int' 'float' tthing of hyperparameter range specification
Summary
Users can now specify which parameters of the selected model can be hyperoptimized for. This way, for models with many hyperparameters, the search space is reduced a lot, whereas previously we would train$n^g$ models ($n:$ number of model parameters, $g:$ Hyperparameter grid points). Also fixes double-dipping problems in model fitting.
Key Changes
hyperopt
model now trains on a dataset split instead of CV due tohyperopt-sklearn
library limitations. The testing set indices for each model (dependent on covariates and systems) are returned in CSV files under theageml/model_age
directory.AgeML.fit_age
method now has 2 train cases instead of 3, making it more concise and efficient.-ht
argument.Other changes
rng
from theageml.datasets
module is used in theui
module to do a index random permutation, instead of usingnp.random
, which is discouraged